1 Exploration and data analysis

We will be using the CWNS data set from 2012 to perform regressions on the data. We will start by preparing the data for regression analysis.

Firstly, we are interested in creating a model that can help us to understand and possibly predict the investment needs for a particular facility. The outcome variables in this case would be TOTAL_OFFICIAL_NEED.

From the histogram above, we see that the original untransformed data is skewed to the left. This is mainly because a lot of the facilities serve a fair amount of people and entities, and there are a few facilities that serve a large population of people and entities and consequently have very high infrastructure needs e.g Flushing, New York. We can create a new variable, LOG_TOTAL_OFFICIAL_NEED, which will be a log (base 10) transformation of TOTAL_OFFICIAL_NEED. The reason for doing this transformation is to make the data more symmetric, and at the same time reduce variability in the output.

We also create another variable, RES_BURDEN, which is a measure of the residential burden of a facility ($ / person). This is calculated by dividing the Total Need by the maximum value between Residents presently receiving treatment and residents presently receiving collection. This would give us a sense of how much the need is spread out per person.

Just like the Total Official Need, we see that the data is also slightly skewed to the left. The figure on the right shows that the log (base10) transformation of RES_BURDEN leads to a more symmetric variable. Moving forward we will conduct regressions with both variables and compare their respective models’ performance.

Some additional variable were also log-transformed to correct for the skewness in some of the variables. While it is not a requirement, having data with a more symmetric distribution may yield a better model performance. These variables transformed were:
1. Population size variables. 2. Income variables.

2 Bivariate Regressions

At this stage, we start our regression analysis by conducting simple bivariate regressions on the some of the explanatory variables and need variables. I will use five explanatory variables that can serve as proxies for the following dimensions:
1. Demographic : PCTWHITE10, POP10, NET_POP_DIFFERENCE
2. Socio-economic: MEDINC09
3. Physical: AWATER, REGION

2.1 Total Official Need vs Population (2010)

We will start off by doing a regression of total official need and population. The expectation should be that as the population served by a facility goes up, then the total need should also go up since more strain is placed on a facility. Therefore, the expectation should be that we will have a positive slope for our bivariate regression.

Let’s look at some scatterplots and see potential relationships between need and population, and also the corresponding transformed variabLes.

From the plots, it would seem that there may be a relationship between LOG_TOTAL_OFFICIALNEED and LOG_POP10, and also between TOTAL_OFFICAL_NEED and LOG_POP10, whereby the first combination looks to be I will do a regression on both sets of data and see which one yields a better model.

From the table above, we see that the model with log transformed variables performs better. The \(\beta_1\) value is statistically significant because the p-value is 0. However, the model can explain only 20.91% of the variance, therefore it offers very limited predictive power.

It could be the case that there are locations with high population and low facility needs, and vice versa. We can observe this from the scatterplot of Total Official Need vs Population above. Such an unusual relationship may be throwing off the regression calculation hence diminishing its predictive power.

Below is a plot for the regression line of both models.

From the plots and the \(R^2\) value, the regression line in the second plot offers a better fit. We can open diagnostic plots and see if we have a chance to improve the model (see appendix)

The residuals plot indicates that the residuals are randomly distributed and there is no resulting pattern. The qqplot also suggest normality of the model, with the exception of a few points that have been identified as outliers (they have been identified by their numbers).

However, when we look at the leverage plots, we see that there are few data points that are presented as outliers. The leverage points affected our model, so we’ll see if we can improve the model by dropping a point that has appeared constantly in all the regression plots.

The model did not significantly improve. Therefore we can choose either model.

2.2 Total Official Need vs Percent White

It would seem that a lot of the data is concentrated in the right side of the plot. This could be because the PCTWHITE10 variable is skewed. This is to be expected since majority of the facilities will be located in locations with a white majority populace. We can try to correct the skewness by also taking a log transformation of PCTWHITE10.

The attempt to make PCTWHITE10 symmetric did not reduce the skewness, therefore we will use the base values for PCTWHITE10 for the regression.

The reason this explanatory variable was chosen was that there is usually a trend where facilities and amenities, regardless of their use, tend to be better in majority white locations with the exception of some locations e.g rural America. Therefore, it could be the case that facilities in largely white locations could have less need.

The table below shows the regression results:

From the results of the modle above, we see that the model is indeed statistically significant with p-values well below 0.05. However the \(R^2\) value of 0.04367 suggests that the model cannot account for a lot of the variance. This result should not come of as surprising, since most facilities are located in regions with a majority white populace. The line representing mean of of the log of Total Need shows that the different need values are spread almost equally on both sides of the mean.

For an exploration based on the racial makeup of a facility’s location, it may be more useful to encode the data with different indicator variables. This is beyond this report’s scope because we do not have enough data on the racial makeup of the facilities and a mutually exclusive assumption may be unfounded.

2.3 Total Official Need vs Median Income

The next predictor for the Total Official Need will be median income. Intuitively, we could expect that the infrastructure need in high medianc income locations should be low because the regions are well maintained and have sufficient revenue streams. We will use a scatterplot to observe this relationship, and also draw a histogram of MEDINC09 to check the trend of the data.

I chose MEDINC09 over MEAINC09 because the mean can be affected by outliers, as already demonsrated in boxplots and previous regressions. Therefore the median will be a sufficient proxy for socio-economic status of a facility’s location.

The plots show that the MEDINC09 data is not heavily skewed, therefore we do not need to perform any additional transformations. We proceed to do the regression. In as much as we can use the log transformation for the data, the improvement will be marginal.

Below is a plot of the regressoin line and a scatterplot of the data.

We can see from the tables that we obtained a statistically significant model, whose regression plots meet the assumptions of a regression model. However, both model performs poorly and has a poor explanatory power i.e \(R^2 = 0.05\) and \(R^2 = 0.004\). The regression line is almost close to the mean value of the dependent variable in both plot, therefore we do not gain any new insights as to the relationship between total official need and income.

The relationship between median income and total official need seems weak, therefore it will be unhelpful to try and predict without

2.4 Total Official Need vs Area of water

The next dependent variable to investigate is the area of land covered by water in the facility’s location. Based on the analysis in the previous homework, facilities that emptied into water bodies tend to have higher needs than facilites that did not empty into water bodies. The other relationship to consider is if the TMDL facilities also tend to be located in regions with large aread under water. We can see that using a boxplot that has been scaled to log values.

The boxplots demonstrate that difference is slight, however, the average value is higher in TMDL facilities. With this in mind, we could assume that TMDL facilities are located in facilities with large areas covered by water. Therefore, these facilites will be the ones with the greatest need. We will test this assumption using the regression.

The plots also show that the AWATER is also skewed to the right, therefore we will use the transformed variable to control for variance

We can look at the scatterplots of the data below:

The regression yield models whose explanatory variables are statistically significant, however, their \(R^2\) values (0.04 and 0.004) respectively are too low, and this means that this model cannot explainn alot of the variance. This can also be observed from the plots with the regression lines above.

On a national level the model did not yield much. A possible route to yield more meaninful result is to compare data points that may have similarities e.g. across regions or peer cities.

2.5 Residential Burden vs Population(2010)

Before moving on to the regressions, we will be using the log transformation of POP10. This is because of the observation made in earlier regressions whereby the log transformed variable yielded better results. This is because the log transformation also reduces the skewed nature of th explanatory variable.

We proceed witht he explanatory variable

## [1] "R squared value of Log Burden vs Log pop10:  0.209480754298962"
## [1] "R squared value of Burden vs Log pop10:  0.0474387911673893"

From the scatterplots above, it would seem that the plots look very similar to the ones generated by regressing population against total official need. The \(R^2\) values of the second model (Residential Burden vs Log Population 2010) is also very low, to the point whereby the model is not doing better than predicting the mean at all values. The first model, with an \(R^2\) value of 0.209 is a stronger model, however, around 80% of the variance is unexplained, therefore we cannot adopt this model for predictive purposes. However, we can say, though to a limited extent, that the higher the population, the higher the need.

The strong similarities between the outcome of the two dependent variable suggest that there might be a trend between the total official need and residential burden; after all the latter is a derivative of the former. We can investigate this using a plot:

There is a linear relationship between the two variables. Therefore, any regression done on total official need will yield more or less that same results as residential burden since the latter is a ‘linear’ transformation of the former. We can show this with a more mathematical statement i.e the correlation coefficient:

## Correlation coefficient for Residential Burden and Total Official Need:  1

3 Multivariate Regressions

Based on the conclusion from the previous section, I will be doing regressions exclusively on TOTAL_OFFICIAL_NEED and the log transformation.

The bivariate yielded some interesting results, however, there were not many useful insights from the bivariate analysis, therefore we try the multivariate regression. In addition to the explanatory variables used above, I will also make use of MS4 categorical variable for the model.

The first step will be to expand some of our variables. We had some categorical variables, which could be useful in the multivariate regressin. It could be the case that the region in which a facility is located could help predict/explain the needs of the facility. Fot this, we will use the package fastDummies. This will create 9 additional variables corresponding to the regions. The region being ommited is region 5, as it is the most represented region in the dataset. The choice of region to ommit when creating the variable is completely arbitrary. I selected region 5, because the package provides an option of having the omitted region as region 5.

3.1 Multivariate without categorical variables

We will begin the multivariate analysis by considering continuous variables in the dataset. For this analysis I will consider the most recent data across various variables. Therefore, I will not make use of old data. The only exception is finding differences in yearly differences. The first set of continuouse data I will be regressing on are:
+ PCTWHITE10
+ POP10
+ MEDINC09
+ AWATER
+ NET_POP_DIFFERENCE
+ ALAND
+ AWATER
+ PRES_RES_REC_TRMT
+ PRES_N_RES_REC_TRMT

I omitted PRES_RES_REC_COLLCTN from the list of continuous variables because it is strongly correlated with PRES_RES_REC_TRMT with a correlation coefficient of 0.9096743. There are some explanatory variables that are correlated with the dependent variable and with each other, however we would want to proceed with the regression, then remove the variables if we have to.

Below is the regression output for TOTAL_OFFICIAL_NEED and the listed explanatory variables.

## 
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + MEDINC09 + 
##     AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + PRES_RES_REC_TRMT + 
##     PRES_N_RES_REC_TRTM, data = final_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.475e+09 -1.007e+07 -3.075e+06  3.911e+06  4.485e+09 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.951e+07  1.590e+07  -1.856  0.06348 .  
## PCTWHITE10          -1.871e+05  9.623e+04  -1.944  0.05189 .  
## LOG_POP10            1.387e+07  2.952e+06   4.698 2.68e-06 ***
## MEDINC09            -2.256e+02  1.132e+02  -1.994  0.04618 *  
## AWATER              -1.455e-03  1.990e-03  -0.731  0.46496    
## NET_POP_DIFFERENCE  -7.424e+01  2.211e+01  -3.358  0.00079 ***
## ALAND               -9.077e-04  3.997e-04  -2.271  0.02316 *  
## PRES_RES_REC_TRMT    7.443e+02  1.631e+01  45.626  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.450e+03  9.236e+01  26.529  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 110600000 on 7069 degrees of freedom
## Multiple R-squared:  0.3916, Adjusted R-squared:  0.3909 
## F-statistic: 568.8 on 8 and 7069 DF,  p-value: < 2.2e-16

Below is the regression output for LOG_TOTAL_OFFICIAL_NEED and the listed explanatory variables.

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5631 -0.4633  0.0291  0.4703  2.2894 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.468e+00  9.786e-02  45.660  < 2e-16 ***
## PCTWHITE10          -1.986e-03  5.923e-04  -3.354 0.000801 ***
## LOG_POP10            4.228e-01  1.817e-02  23.272  < 2e-16 ***
## MEDINC09             5.061e-07  6.964e-07   0.727 0.467382    
## AWATER               7.076e-12  1.225e-11   0.578 0.563535    
## NET_POP_DIFFERENCE   1.574e-07  1.361e-07   1.157 0.247471    
## ALAND               -1.366e-12  2.460e-12  -0.555 0.578796    
## PRES_RES_REC_TRMT    2.372e-06  1.004e-07  23.621  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.137e-06  5.684e-07   3.759 0.000172 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6809 on 7069 degrees of freedom
## Multiple R-squared:  0.2843, Adjusted R-squared:  0.2835 
## F-statistic:   351 on 8 and 7069 DF,  p-value: < 2.2e-16

From the regression outputs above, we see that the model with TOTAL_OFFICIAL_NEED as the dependent variable performs better. However, in both models, there are variables that have high p-values, thus negatively affecting the \(R^2\) value and violating some model building assumptions. We will then proceed to perform a stepwise regression to obtain the best model.

Before proceeding to the stepwise regressions, we can see the VIF tables for the two models and see if we can improve the model and try comparing this with the models produced by stepwise regressions.

VIF table for TOTAL_OFFICIAL_NEED model:

##          PCTWHITE10           LOG_POP10            MEDINC09 
##              1.2327              2.1333              1.5458 
##              AWATER  NET_POP_DIFFERENCE               ALAND 
##              1.0744              1.5599              1.2339 
##   PRES_RES_REC_TRMT PRES_N_RES_REC_TRTM 
##              1.2563              1.1201

VIF table for LOG_TOTAL_OFFICIAL_NEED model:

##          PCTWHITE10           LOG_POP10            MEDINC09 
##              1.2327              2.1333              1.5458 
##              AWATER  NET_POP_DIFFERENCE               ALAND 
##              1.0744              1.5599              1.2339 
##   PRES_RES_REC_TRMT PRES_N_RES_REC_TRTM 
##              1.2563              1.1201

Both models have variables with relatively small VIF scores. Therefore both models do not suffer excessively from multicollinearity. However, there are variables with high p-values that suggest we can improve the performance of the model, through model selection. It must be noted from the residual plots(see appendix), that the models with TOTAL_OFFICIAL_NEED have residuals that are not normally distributed. Therefore these models do not satisfy the assumptions necessary for a linear regression model. The rest of the multivariate analysis will use TOTAL_OFFICIAL_NEED.

3.1.1 Improving the model Using Stepwise Regression

The stepwise regression will select variable based on AIC. Below is the summary table for the stepwise regression for TOTAL_OFFICIAL_NEED

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6373 -0.4605  0.0297  0.4691  2.2792 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.415e+00  9.076e-02  48.647  < 2e-16 ***
## PCTWHITE10          -1.892e-03  5.668e-04  -3.339 0.000846 ***
## LOG_POP10            4.384e-01  1.365e-02  32.120  < 2e-16 ***
## PRES_RES_REC_TRMT    2.377e-06  1.000e-07  23.759  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.104e-06  5.667e-07   3.713 0.000206 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6808 on 7073 degrees of freedom
## Multiple R-squared:  0.2841, Adjusted R-squared:  0.2837 
## F-statistic: 701.7 on 4 and 7073 DF,  p-value: < 2.2e-16

VIF for stepwise regression with LOG_TOTAL_OFFICIAL_NEED as the dependent variable:

VIF
x
PCTWHITE10 1.129
LOG_POP10 1.204
PRES_RES_REC_TRMT 1.248
PRES_N_RES_REC_TRTM 1.114

From the regression summary table, and the VIF table, we obtain a good model that can explain around 28.4% of the variance. While the model cannot be used for prediction, it gives us a hint of some variables that are jointly and independently statistically significant.

Can we improve the model if we drop the intercept?

When we think about our dependent variable i.e. TOTAL_OFFICIAL_NEED or the log transformed dependent variable, we would expect that if a facility existed in a location that does not have a population, or even land or water, then it would not have any need whatsoever. Therefore, we will try another regression that will omit the intercept.

** Removing the intercept **

Summary of LOG_TOTAL_OFFICIAL_NEED:

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + 0, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9573 -0.4806  0.0475  0.5471  3.7543 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## PCTWHITE10           1.638e-02  4.946e-04  33.117  < 2e-16 ***
## LOG_POP10            1.105e+00  1.177e-02  93.834  < 2e-16 ***
## MEDINC09            -5.839e-06  7.765e-07  -7.520 6.15e-14 ***
## AWATER              -3.970e-11  1.389e-11  -2.858 0.004275 ** 
## NET_POP_DIFFERENCE  -1.760e-06  1.473e-07 -11.947  < 2e-16 ***
## ALAND                1.315e-11  2.775e-12   4.739 2.19e-06 ***
## PRES_RES_REC_TRMT    1.936e-06  1.137e-07  17.022  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.409e-06  6.468e-07   3.725 0.000197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7748 on 7070 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9857 
## F-statistic: 6.092e+04 on 8 and 7070 DF,  p-value: < 2.2e-16
VIF table
VIF
x
PCTWHITE10 1.233
LOG_POP10 2.133
MEDINC09 1.546
AWATER 1.074
NET_POP_DIFFERENCE 1.560
ALAND 1.234
PRES_RES_REC_TRMT 1.256
PRES_N_RES_REC_TRTM 1.120

By dropping the intercept term, we see a drastic improvement in performance. With an \(R^2\) value of 0.986, the model performs well and can account for alot of the variance. The F-statistic also shows that the \(R^2\) value is statistically significant. A look at the residual plots (see appendix) demonstrates that the residuals are normally distributed. However, we there is also a semblance of a trend in the residuals therefore we should take this result with a grain of salt.

The stepwise regression did not yield much of a result based on its VIF table and residual plots. Therefore our best model was the one yielded from the stepwise regression of the model that contained the intercept.

3.1.2 Including categorical variables

There are several categorical variables in the dataset that could provide a better model since the provide more information points. These are:
+ REGION
+ TMDL_INDICATOR
+ MS4

In this model, I will only use LOG_TOTAL_OFFICIAL_NEED because it is an easier dependent variable to work with because it has less variance. As mentioned earlier, because regions has 10 options, we had to create 9 dummy variables to accomodate this data. The ommitted region i.e. region 5, it will correspond to a value of 0 across the 9 dummy variables.

Below is the result of regressing all of the continuous and categorical variables.

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + 
##     EPA_REGION_1 + EPA_REGION_2 + EPA_REGION_3 + EPA_REGION_4 + 
##     EPA_REGION_6 + EPA_REGION_7 + EPA_REGION_8 + EPA_REGION_9 + 
##     EPA_REGION_10 + GROWING_CITY, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4433 -0.4458  0.0222  0.4517  2.2226 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.617e+00  1.283e-01  43.776  < 2e-16 ***
## PCTWHITE10          -2.676e-04  6.540e-04  -0.409 0.682481    
## LOG_POP10            3.301e-01  1.928e-02  17.123  < 2e-16 ***
## MEDINC09             2.774e-06  7.552e-07   3.674 0.000241 ***
## AWATER               6.901e-12  1.187e-11   0.581 0.561153    
## NET_POP_DIFFERENCE   2.592e-07  1.388e-07   1.867 0.061911 .  
## ALAND               -7.259e-12  2.867e-12  -2.532 0.011369 *  
## PRES_RES_REC_TRMT    2.118e-06  9.795e-08  21.627  < 2e-16 ***
## PRES_N_RES_REC_TRTM  1.387e-06  5.483e-07   2.530 0.011414 *  
## MS4                 -5.794e-01  3.062e-02 -18.925  < 2e-16 ***
## TMDL_dummy           1.845e-01  4.180e-02   4.412 1.04e-05 ***
## EPA_REGION_1         1.881e-01  4.182e-02   4.498 6.97e-06 ***
## EPA_REGION_2         1.074e-01  3.894e-02   2.757 0.005846 ** 
## EPA_REGION_3         2.433e-01  2.954e-02   8.239  < 2e-16 ***
## EPA_REGION_4         3.442e-01  3.004e-02  11.460  < 2e-16 ***
## EPA_REGION_6         8.888e-02  3.101e-02   2.866 0.004173 ** 
## EPA_REGION_7        -5.880e-02  2.818e-02  -2.087 0.036966 *  
## EPA_REGION_8         1.312e-01  3.880e-02   3.381 0.000726 ***
## EPA_REGION_9         5.239e-01  5.789e-02   9.049  < 2e-16 ***
## EPA_REGION_10        2.665e-01  4.772e-02   5.585 2.43e-08 ***
## GROWING_CITY         1.152e-02  1.968e-02   0.585 0.558380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6508 on 7057 degrees of freedom
## Multiple R-squared:  0.3473, Adjusted R-squared:  0.3454 
## F-statistic: 187.7 on 20 and 7057 DF,  p-value: < 2.2e-16

The F-statistic shows that the variables used in the model are statistically significant depsite some of the variables having high p-values. We can check the VIF for the regression below:

VIF
x
PCTWHITE10 1.646
LOG_POP10 2.629
MEDINC09 1.990
AWATER 1.105
NET_POP_DIFFERENCE 1.777
ALAND 1.835
PRES_RES_REC_TRMT 1.309
PRES_N_RES_REC_TRTM 1.141
MS4 1.165
TMDL_dummy 1.117
EPA_REGION_1 1.228
EPA_REGION_2 1.284
EPA_REGION_3 1.420
EPA_REGION_4 2.172
EPA_REGION_6 1.934
EPA_REGION_7 1.678
EPA_REGION_8 1.486
EPA_REGION_9 1.879
EPA_REGION_10 1.256
GROWING_CITY 1.328

The VIF table shows that multicollinearity does not significantly affect our model. However, an inspection of the regression’s summary table shows that there are some variables that have high p-values which suggests that the model can be improved. We can use stepwise regression to do some variable selection. The summary table of the best model from stepwise regression is:

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ LOG_POP10 + MEDINC09 + 
##     NET_POP_DIFFERENCE + ALAND + PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + 
##     MS4 + TMDL_dummy + EPA_REGION_1 + EPA_REGION_2 + EPA_REGION_3 + 
##     EPA_REGION_4 + EPA_REGION_6 + EPA_REGION_7 + EPA_REGION_8 + 
##     EPA_REGION_9 + EPA_REGION_10, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4444 -0.4435  0.0209  0.4506  2.2236 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.569e+00  9.696e-02  57.435  < 2e-16 ***
## LOG_POP10            3.349e-01  1.815e-02  18.455  < 2e-16 ***
## MEDINC09             2.882e-06  7.301e-07   3.947 7.97e-05 ***
## NET_POP_DIFFERENCE   2.477e-07  1.379e-07   1.796 0.072551 .  
## ALAND               -6.919e-12  2.820e-12  -2.453 0.014180 *  
## PRES_RES_REC_TRMT    2.120e-06  9.733e-08  21.784  < 2e-16 ***
## PRES_N_RES_REC_TRTM  1.410e-06  5.469e-07   2.578 0.009969 ** 
## MS4                 -5.776e-01  3.053e-02 -18.918  < 2e-16 ***
## TMDL_dummy           1.851e-01  4.177e-02   4.432 9.49e-06 ***
## EPA_REGION_1         1.884e-01  4.179e-02   4.509 6.62e-06 ***
## EPA_REGION_2         1.092e-01  3.875e-02   2.819 0.004825 ** 
## EPA_REGION_3         2.425e-01  2.945e-02   8.235  < 2e-16 ***
## EPA_REGION_4         3.507e-01  2.760e-02  12.708  < 2e-16 ***
## EPA_REGION_6         9.477e-02  2.885e-02   3.285 0.001026 ** 
## EPA_REGION_7        -5.991e-02  2.809e-02  -2.133 0.032965 *  
## EPA_REGION_8         1.318e-01  3.842e-02   3.430 0.000607 ***
## EPA_REGION_9         5.293e-01  5.605e-02   9.443  < 2e-16 ***
## EPA_REGION_10        2.714e-01  4.734e-02   5.733 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6507 on 7060 degrees of freedom
## Multiple R-squared:  0.3472, Adjusted R-squared:  0.3456 
## F-statistic: 220.9 on 17 and 7060 DF,  p-value: < 2.2e-16

The stepwise regression yields a model with an \(R^2\) value of 0.347. The variables are statistically significant if we adjust our p-value threshold to 0.1. The VIF table below shows that this model is not adversely affected by multicolinearity. The F-statistic also checks out the model.

VIF
x
LOG_POP10 2.330
MEDINC09 1.860
NET_POP_DIFFERENCE 1.754
ALAND 1.776
PRES_RES_REC_TRMT 1.293
PRES_N_RES_REC_TRTM 1.135
MS4 1.159
TMDL_dummy 1.116
EPA_REGION_1 1.227
EPA_REGION_2 1.272
EPA_REGION_3 1.412
EPA_REGION_4 1.834
EPA_REGION_6 1.674
EPA_REGION_7 1.668
EPA_REGION_8 1.457
EPA_REGION_9 1.762
EPA_REGION_10 1.237

The residual plots show that the model meets the assumptions for an OLS model. Therefore, we can use this model to attempt to predict the total infrastructure needs despite explaining only 35% of the variance in the dataset. However, there is room for improvement and one area possible area would be removing the intercept. We would still expect that the infrastructure needs should be zero if all the other explanatory variables are also zero.

Regression without an Intercept

The regression’s summary is shown below:

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + 
##     EPA_REGION_1 + EPA_REGION_2 + EPA_REGION_3 + EPA_REGION_4 + 
##     EPA_REGION_6 + EPA_REGION_7 + EPA_REGION_8 + EPA_REGION_9 + 
##     EPA_REGION_10 + GROWING_CITY + 0, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7899 -0.4661  0.0340  0.5182  2.8659 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## PCTWHITE10           1.791e-02  5.698e-04  31.424  < 2e-16 ***
## LOG_POP10            9.312e-01  1.526e-02  61.033  < 2e-16 ***
## MEDINC09             1.532e-06  8.509e-07   1.800 0.071883 .  
## AWATER              -2.197e-11  1.337e-11  -1.644 0.100302    
## NET_POP_DIFFERENCE  -1.908e-06  1.462e-07 -13.051  < 2e-16 ***
## ALAND                3.140e-12  3.222e-12   0.974 0.329857    
## PRES_RES_REC_TRMT    2.089e-06  1.104e-07  18.913  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.794e-06  6.172e-07   4.528 6.06e-06 ***
## MS4                  4.324e-02  3.057e-02   1.415 0.157227    
## TMDL_dummy           2.555e-01  4.710e-02   5.425 5.98e-08 ***
## EPA_REGION_1         1.565e-01  4.715e-02   3.319 0.000908 ***
## EPA_REGION_2         1.338e-01  4.390e-02   3.047 0.002320 ** 
## EPA_REGION_3         3.533e-01  3.318e-02  10.648  < 2e-16 ***
## EPA_REGION_4         7.367e-01  3.233e-02  22.789  < 2e-16 ***
## EPA_REGION_6         4.850e-01  3.345e-02  14.501  < 2e-16 ***
## EPA_REGION_7         1.689e-01  3.123e-02   5.409 6.54e-08 ***
## EPA_REGION_8         4.178e-01  4.313e-02   9.687  < 2e-16 ***
## EPA_REGION_9         8.477e-01  6.474e-02  13.093  < 2e-16 ***
## EPA_REGION_10        3.914e-01  5.371e-02   7.287 3.50e-13 ***
## GROWING_CITY        -1.657e-01  2.171e-02  -7.633 2.59e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7338 on 7058 degrees of freedom
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9872 
## F-statistic: 2.721e+04 on 20 and 7058 DF,  p-value: < 2.2e-16

We see that the the performance of the model drastically improves. The model can account for 98.7% of the variance. We can then see if the ANOVA and VIF table and residual plots support this model.

ANOVA
Df Sum Sq Mean Sq F value Pr(>F)
PCTWHITE10 1 280205.432 280205.432 520326.745 0.000
LOG_POP10 1 12020.507 12020.507 22321.449 0.000
MEDINC09 1 45.650 45.650 84.770 0.000
AWATER 1 1.814 1.814 3.368 0.067
NET_POP_DIFFERENCE 1 56.734 56.734 105.352 0.000
ALAND 1 9.586 9.586 17.800 0.000
PRES_RES_REC_TRMT 1 216.571 216.571 402.161 0.000
PRES_N_RES_REC_TRTM 1 8.327 8.327 15.464 0.000
MS4 1 83.614 83.614 155.267 0.000
TMDL_dummy 1 3.530 3.530 6.555 0.010
EPA_REGION_1 1 2.369 2.369 4.400 0.036
EPA_REGION_2 1 9.736 9.736 18.080 0.000
EPA_REGION_3 1 3.376 3.376 6.269 0.012
EPA_REGION_4 1 149.617 149.617 277.830 0.000
EPA_REGION_6 1 43.997 43.997 81.699 0.000
EPA_REGION_7 1 1.134 1.134 2.105 0.147
EPA_REGION_8 1 16.367 16.367 30.394 0.000
EPA_REGION_9 1 74.404 74.404 138.165 0.000
EPA_REGION_10 1 23.676 23.676 43.965 0.000
GROWING_CITY 1 31.377 31.377 58.266 0.000
Residuals 7058 3800.862 0.539 NA NA

The F statistic for some individual variables are not statistically significant despite the overall F-statistic for the model that suggests that the the variables are jointly statistically significant. We can use the VIF table to show if our model suffers from multicollinearity.

VIF table:

VIF
x
PCTWHITE10 30.748
LOG_POP10 72.559
MEDINC09 33.504
AWATER 1.190
NET_POP_DIFFERENCE 1.690
ALAND 2.785
PRES_RES_REC_TRMT 1.379
PRES_N_RES_REC_TRTM 1.145
MS4 46.146
TMDL_dummy 1.162
EPA_REGION_1 1.284
EPA_REGION_2 1.356
EPA_REGION_3 1.582
EPA_REGION_4 2.396
EPA_REGION_6 2.057
EPA_REGION_7 1.903
EPA_REGION_8 1.540
EPA_REGION_9 1.915
EPA_REGION_10 1.296
GROWING_CITY 4.410

The VIF values for some variables are really high suggesting a severe multicollinearity problem in this model. The residual plots also show a clear trend, suggesting that this model cannot be used. We will first attempt stepwise regression to see if we can fix this problem.

Stepwise Regression

Here is a summary of the stepwise regression:

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + PRES_RES_REC_TRMT + 
##     PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + EPA_REGION_1 + EPA_REGION_2 + 
##     EPA_REGION_3 + EPA_REGION_4 + EPA_REGION_6 + EPA_REGION_7 + 
##     EPA_REGION_8 + EPA_REGION_9 + EPA_REGION_10 + GROWING_CITY - 
##     1, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7883 -0.4672  0.0335  0.5181  2.8370 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## PCTWHITE10           1.795e-02  5.682e-04  31.591  < 2e-16 ***
## LOG_POP10            9.331e-01  1.513e-02  61.662  < 2e-16 ***
## MEDINC09             1.337e-06  8.271e-07   1.617 0.105998    
## AWATER              -1.972e-11  1.317e-11  -1.498 0.134259    
## NET_POP_DIFFERENCE  -1.875e-06  1.421e-07 -13.192  < 2e-16 ***
## PRES_RES_REC_TRMT    2.083e-06  1.103e-07  18.888  < 2e-16 ***
## PRES_N_RES_REC_TRTM  2.772e-06  6.167e-07   4.495 7.08e-06 ***
## MS4                  4.500e-02  3.052e-02   1.474 0.140403    
## TMDL_dummy           2.543e-01  4.708e-02   5.402 6.80e-08 ***
## EPA_REGION_1         1.610e-01  4.692e-02   3.431 0.000605 ***
## EPA_REGION_2         1.355e-01  4.387e-02   3.088 0.002020 ** 
## EPA_REGION_3         3.523e-01  3.316e-02  10.622  < 2e-16 ***
## EPA_REGION_4         7.342e-01  3.222e-02  22.783  < 2e-16 ***
## EPA_REGION_6         4.856e-01  3.344e-02  14.521  < 2e-16 ***
## EPA_REGION_7         1.690e-01  3.123e-02   5.412 6.44e-08 ***
## EPA_REGION_8         4.309e-01  4.095e-02  10.523  < 2e-16 ***
## EPA_REGION_9         8.752e-01  5.823e-02  15.030  < 2e-16 ***
## EPA_REGION_10        4.033e-01  5.231e-02   7.709 1.45e-14 ***
## GROWING_CITY        -1.652e-01  2.171e-02  -7.610 3.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7338 on 7059 degrees of freedom
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9872 
## F-statistic: 2.864e+04 on 19 and 7059 DF,  p-value: < 2.2e-16

The ANOVA table and VIF table are shown below:

ANOVA table
Df Sum Sq Mean Sq F value Pr(>F)
PCTWHITE10 1 280205.432 280205.432 520330.461 0.000
LOG_POP10 1 12020.507 12020.507 22321.608 0.000
MEDINC09 1 45.650 45.650 84.770 0.000
AWATER 1 1.814 1.814 3.368 0.067
NET_POP_DIFFERENCE 1 56.734 56.734 105.353 0.000
PRES_RES_REC_TRMT 1 213.034 213.034 395.596 0.000
PRES_N_RES_REC_TRTM 1 7.966 7.966 14.793 0.000
MS4 1 89.388 89.388 165.990 0.000
TMDL_dummy 1 3.225 3.225 5.988 0.014
EPA_REGION_1 1 1.963 1.963 3.646 0.056
EPA_REGION_2 1 9.971 9.971 18.516 0.000
EPA_REGION_3 1 2.479 2.479 4.604 0.032
EPA_REGION_4 1 127.810 127.810 237.338 0.000
EPA_REGION_6 1 31.556 31.556 58.599 0.000
EPA_REGION_7 1 0.035 0.035 0.065 0.799
EPA_REGION_8 1 27.952 27.952 51.906 0.000
EPA_REGION_9 1 104.256 104.256 193.599 0.000
EPA_REGION_10 1 26.350 26.350 48.931 0.000
GROWING_CITY 1 31.185 31.185 57.909 0.000
Residuals 7059 3801.373 0.539 NA NA

VIF Table:

VIF
x
PCTWHITE10 30.568
LOG_POP10 71.376
MEDINC09 31.660
AWATER 1.155
NET_POP_DIFFERENCE 1.597
PRES_RES_REC_TRMT 1.376
PRES_N_RES_REC_TRTM 1.143
MS4 45.987
TMDL_dummy 1.161
EPA_REGION_1 1.271
EPA_REGION_2 1.354
EPA_REGION_3 1.581
EPA_REGION_4 2.381
EPA_REGION_6 2.056
EPA_REGION_7 1.903
EPA_REGION_8 1.389
EPA_REGION_9 1.549
EPA_REGION_10 1.230
GROWING_CITY 4.407

The best model from stepAIC yields variables that have high multicollinearity therefore, our stepwise regression turned out to be a dead end.

3.1.3 Models with interaction terms

The last types of models that we can consider is adding interaction terms in the models. The interactions I will consider will be:
1. Growing city x Population 2010
+ I want to investigate the combined effect of a growing city on a population
2. MS4 x AWATER
+ Could a large area under water coupled with a sewer system have an impact on the infrastructure needs
3. TMDL x PRES_RES_REC_TRMT
4. TMDL x AWATER
5. MS4 x NET_POP_DIFFERENCE
6. EPA_REGION_3 x NET_POP_DIFFERENCE
+ This is specific to the region that Philadelphia is located. Does the population in Region 3 have an effect?
7. EPA_REGION_3 X MS4 + Is there an interaction effect between the region and the choice of sewer system?

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + 
##     EPA_REGION_1 + EPA_REGION_2 + EPA_REGION_3 + EPA_REGION_4 + 
##     EPA_REGION_6 + EPA_REGION_7 + EPA_REGION_8 + EPA_REGION_9 + 
##     EPA_REGION_10 + MS4 * AWATER + TMDL_dummy * PRES_RES_REC_TRMT + 
##     GROWING_CITY + GROWING_CITY * LOG_POP10 + MS4 * AWATER + 
##     TMDL_dummy * AWATER + MS4 * NET_POP_DIFFERENCE + EPA_REGION_3 * 
##     MS4 + EPA_REGION_3 * MS4, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4951 -0.4411  0.0202  0.4510  2.2200 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.606e+00  1.651e-01  33.950  < 2e-16 ***
## PCTWHITE10                   -1.806e-04  6.554e-04  -0.276 0.782889    
## LOG_POP10                     3.642e-01  2.770e-02  13.150  < 2e-16 ***
## MEDINC09                      2.698e-06  7.704e-07   3.502 0.000465 ***
## AWATER                       -4.110e-10  1.256e-10  -3.271 0.001075 ** 
## NET_POP_DIFFERENCE            4.686e-06  1.719e-06   2.726 0.006420 ** 
## ALAND                        -6.776e-12  2.871e-12  -2.360 0.018307 *  
## PRES_RES_REC_TRMT             2.264e-06  1.047e-07  21.617  < 2e-16 ***
## PRES_N_RES_REC_TRTM           1.538e-06  5.587e-07   2.752 0.005934 ** 
## MS4                          -6.563e-01  3.779e-02 -17.369  < 2e-16 ***
## TMDL_dummy                    2.144e-01  4.399e-02   4.873 1.13e-06 ***
## EPA_REGION_1                  2.076e-01  4.238e-02   4.898 9.88e-07 ***
## EPA_REGION_2                  1.165e-01  3.901e-02   2.986 0.002838 ** 
## EPA_REGION_3                 -2.447e-01  1.383e-01  -1.769 0.076858 .  
## EPA_REGION_4                  3.550e-01  3.014e-02  11.779  < 2e-16 ***
## EPA_REGION_6                  1.051e-01  3.119e-02   3.370 0.000756 ***
## EPA_REGION_7                 -4.396e-02  2.849e-02  -1.543 0.122940    
## EPA_REGION_8                  1.465e-01  3.891e-02   3.765 0.000168 ***
## EPA_REGION_9                  5.298e-01  5.804e-02   9.128  < 2e-16 ***
## EPA_REGION_10                 2.701e-01  4.787e-02   5.641 1.75e-08 ***
## GROWING_CITY                  2.241e-01  1.531e-01   1.464 0.143357    
## AWATER:MS4                    2.089e-10  6.372e-11   3.278 0.001051 ** 
## PRES_RES_REC_TRMT:TMDL_dummy -7.201e-07  2.360e-07  -3.052 0.002285 ** 
## LOG_POP10:GROWING_CITY       -4.817e-02  3.351e-02  -1.437 0.150678    
## AWATER:TMDL_dummy             3.252e-11  2.888e-11   1.126 0.260159    
## NET_POP_DIFFERENCE:MS4       -2.193e-06  8.621e-07  -2.544 0.010969 *  
## MS4:EPA_REGION_3              2.631e-01  7.340e-02   3.584 0.000341 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6493 on 7051 degrees of freedom
## Multiple R-squared:  0.351,  Adjusted R-squared:  0.3486 
## F-statistic: 146.6 on 26 and 7051 DF,  p-value: < 2.2e-16

We obtain a model with a statistically significant \(R^2\) value of 0.351. However, there are variables whose p-values suggest that we can possibly improve the model.

Below is the VIF table

VIF
x
PCTWHITE10 1.660
LOG_POP10 5.453
MEDINC09 2.081
AWATER 124.280
NET_POP_DIFFERENCE 273.680
ALAND 1.849
PRES_RES_REC_TRMT 1.504
PRES_N_RES_REC_TRTM 1.190
MS4 1.784
TMDL_dummy 1.243
EPA_REGION_1 1.267
EPA_REGION_2 1.295
EPA_REGION_3 31.270
EPA_REGION_4 2.197
EPA_REGION_6 1.965
EPA_REGION_7 1.724
EPA_REGION_8 1.501
EPA_REGION_9 1.898
EPA_REGION_10 1.271
GROWING_CITY 80.811
AWATER:MS4 124.770
PRES_RES_REC_TRMT:TMDL_dummy 1.392
LOG_POP10:GROWING_CITY 101.220
AWATER:TMDL_dummy 1.321
NET_POP_DIFFERENCE:MS4 272.430
MS4:EPA_REGION_3 31.252

The model has high collinearity, which is to be expected since there are a lot of interaction terms. We can utilize stepwise regression to obtain a better model.

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ LOG_POP10 + MEDINC09 + 
##     AWATER + NET_POP_DIFFERENCE + ALAND + PRES_RES_REC_TRMT + 
##     PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + EPA_REGION_1 + EPA_REGION_2 + 
##     EPA_REGION_3 + EPA_REGION_4 + EPA_REGION_6 + EPA_REGION_7 + 
##     EPA_REGION_8 + EPA_REGION_9 + EPA_REGION_10 + GROWING_CITY + 
##     AWATER:MS4 + PRES_RES_REC_TRMT:TMDL_dummy + LOG_POP10:GROWING_CITY + 
##     NET_POP_DIFFERENCE:MS4 + MS4:EPA_REGION_3, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4960 -0.4414  0.0208  0.4504  2.2205 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.592e+00  1.469e-01  38.071  < 2e-16 ***
## LOG_POP10                     3.644e-01  2.739e-02  13.306  < 2e-16 ***
## MEDINC09                      2.674e-06  7.675e-07   3.484 0.000497 ***
## AWATER                       -4.164e-10  1.255e-10  -3.318 0.000912 ***
## NET_POP_DIFFERENCE            4.689e-06  1.718e-06   2.729 0.006375 ** 
## ALAND                        -6.936e-12  2.868e-12  -2.419 0.015603 *  
## PRES_RES_REC_TRMT             2.267e-06  1.043e-07  21.724  < 2e-16 ***
## PRES_N_RES_REC_TRTM           1.548e-06  5.579e-07   2.775 0.005537 ** 
## MS4                          -6.586e-01  3.772e-02 -17.460  < 2e-16 ***
## TMDL_dummy                    2.231e-01  4.330e-02   5.151 2.66e-07 ***
## EPA_REGION_1                  2.076e-01  4.237e-02   4.899 9.85e-07 ***
## EPA_REGION_2                  1.166e-01  3.889e-02   2.999 0.002717 ** 
## EPA_REGION_3                 -2.470e-01  1.382e-01  -1.786 0.074080 .  
## EPA_REGION_4                  3.589e-01  2.796e-02  12.838  < 2e-16 ***
## EPA_REGION_6                  1.094e-01  2.927e-02   3.738 0.000187 ***
## EPA_REGION_7                 -4.214e-02  2.845e-02  -1.481 0.138620    
## EPA_REGION_8                  1.493e-01  3.871e-02   3.856 0.000116 ***
## EPA_REGION_9                  5.345e-01  5.631e-02   9.493  < 2e-16 ***
## EPA_REGION_10                 2.718e-01  4.777e-02   5.689 1.33e-08 ***
## GROWING_CITY                  2.252e-01  1.526e-01   1.475 0.140136    
## AWATER:MS4                    2.152e-10  6.346e-11   3.391 0.000701 ***
## PRES_RES_REC_TRMT:TMDL_dummy -7.246e-07  2.359e-07  -3.071 0.002141 ** 
## LOG_POP10:GROWING_CITY       -4.832e-02  3.343e-02  -1.445 0.148428    
## NET_POP_DIFFERENCE:MS4       -2.194e-06  8.619e-07  -2.546 0.010914 *  
## MS4:EPA_REGION_3              2.652e-01  7.335e-02   3.615 0.000302 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6492 on 7053 degrees of freedom
## Multiple R-squared:  0.3508, Adjusted R-squared:  0.3486 
## F-statistic: 158.8 on 24 and 7053 DF,  p-value: < 2.2e-16

The stepwise regression yielded a very small improvement over the first model. We will try to implement regression with no-intercept to see if it offers an improvement over the previous model.

## 
## Call:
## lm(formula = LOG_TOTAL_OFFICIAL_NEED ~ PCTWHITE10 + LOG_POP10 + 
##     MEDINC09 + AWATER + NET_POP_DIFFERENCE + ALAND + AWATER + 
##     PRES_RES_REC_TRMT + PRES_N_RES_REC_TRTM + MS4 + TMDL_dummy + 
##     EPA_REGION_1 + EPA_REGION_2 + EPA_REGION_3 + EPA_REGION_4 + 
##     EPA_REGION_6 + EPA_REGION_7 + EPA_REGION_8 + EPA_REGION_9 + 
##     EPA_REGION_10 + MS4 * AWATER + TMDL_dummy * PRES_RES_REC_TRMT + 
##     GROWING_CITY + GROWING_CITY * LOG_POP10 + 0, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0381 -0.4530  0.0292  0.4953  2.2943 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## PCTWHITE10                    1.069e-02  6.261e-04  17.070  < 2e-16 ***
## LOG_POP10                     1.125e+00  1.727e-02  65.121  < 2e-16 ***
## MEDINC09                      5.189e-06  8.335e-07   6.226 5.07e-10 ***
## AWATER                        1.357e-10  1.334e-10   1.018   0.3089    
## NET_POP_DIFFERENCE            1.278e-08  1.619e-07   0.079   0.9371    
## ALAND                        -4.323e-12  3.115e-12  -1.388   0.1653    
## PRES_RES_REC_TRMT             2.215e-06  1.133e-07  19.559  < 2e-16 ***
## PRES_N_RES_REC_TRTM           3.069e-06  6.039e-07   5.081 3.85e-07 ***
## MS4                          -1.388e-01  3.239e-02  -4.287 1.84e-05 ***
## TMDL_dummy                    2.468e-01  4.703e-02   5.248 1.58e-07 ***
## EPA_REGION_1                  2.585e-01  4.600e-02   5.620 1.98e-08 ***
## EPA_REGION_2                  1.768e-01  4.234e-02   4.177 2.99e-05 ***
## EPA_REGION_3                  3.026e-01  3.199e-02   9.459  < 2e-16 ***
## EPA_REGION_4                  6.354e-01  3.140e-02  20.237  < 2e-16 ***
## EPA_REGION_6                  3.974e-01  3.244e-02  12.250  < 2e-16 ***
## EPA_REGION_7                  1.654e-01  3.009e-02   5.496 4.02e-08 ***
## EPA_REGION_8                  3.669e-01  4.156e-02   8.829  < 2e-16 ***
## EPA_REGION_9                  8.043e-01  6.237e-02  12.895  < 2e-16 ***
## EPA_REGION_10                 4.205e-01  5.173e-02   8.130 5.03e-16 ***
## GROWING_CITY                  3.049e+00  1.379e-01  22.118  < 2e-16 ***
## AWATER:MS4                   -7.689e-11  6.737e-11  -1.141   0.2538    
## PRES_RES_REC_TRMT:TMDL_dummy -5.067e-07  2.561e-07  -1.979   0.0479 *  
## LOG_POP10:GROWING_CITY       -6.945e-01  2.946e-02 -23.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7057 on 7055 degrees of freedom
## Multiple R-squared:  0.9882, Adjusted R-squared:  0.9881 
## F-statistic: 2.561e+04 on 23 and 7055 DF,  p-value: < 2.2e-16
x
PCTWHITE10 40.138
LOG_POP10 100.540
MEDINC09 34.766
AWATER 128.130
NET_POP_DIFFERENCE 2.240
ALAND 2.816
PRES_RES_REC_TRMT 1.569
PRES_N_RES_REC_TRTM 1.185
MS4 56.030
TMDL_dummy 1.253
EPA_REGION_1 1.322
EPA_REGION_2 1.364
EPA_REGION_3 1.590
EPA_REGION_4 2.445
EPA_REGION_6 2.092
EPA_REGION_7 1.911
EPA_REGION_8 1.547
EPA_REGION_9 1.922
EPA_REGION_10 1.300
GROWING_CITY 192.280
AWATER:MS4 127.080
PRES_RES_REC_TRMT:TMDL_dummy 1.392
LOG_POP10:GROWING_CITY 221.750
Df Sum Sq Mean Sq F value Pr(>F)
PCTWHITE10 1 280205.432 280205.432 562718.375 0.000
LOG_POP10 1 12020.507 12020.507 24140.004 0.000
MEDINC09 1 45.650 45.650 91.676 0.000
AWATER 1 1.814 1.814 3.643 0.056
NET_POP_DIFFERENCE 1 56.734 56.734 113.935 0.000
ALAND 1 9.586 9.586 19.250 0.000
PRES_RES_REC_TRMT 1 216.571 216.571 434.925 0.000
PRES_N_RES_REC_TRTM 1 8.327 8.327 16.724 0.000
MS4 1 83.614 83.614 167.917 0.000
TMDL_dummy 1 3.530 3.530 7.089 0.008
EPA_REGION_1 1 2.369 2.369 4.758 0.029
EPA_REGION_2 1 9.736 9.736 19.553 0.000
EPA_REGION_3 1 3.376 3.376 6.780 0.009
EPA_REGION_4 1 149.617 149.617 300.465 0.000
EPA_REGION_6 1 43.997 43.997 88.356 0.000
EPA_REGION_7 1 1.134 1.134 2.277 0.131
EPA_REGION_8 1 16.367 16.367 32.870 0.000
EPA_REGION_9 1 74.404 74.404 149.422 0.000
EPA_REGION_10 1 23.676 23.676 47.547 0.000
GROWING_CITY 1 31.377 31.377 63.013 0.000
AWATER:MS4 1 9.141 9.141 18.358 0.000
PRES_RES_REC_TRMT:TMDL_dummy 1 1.930 1.930 3.876 0.049
LOG_POP10:GROWING_CITY 1 276.755 276.755 555.790 0.000
Residuals 7055 3513.035 0.498 NA NA

3.2 Comparison of models

We ran models under 3 general categories: (i) Without categorical variables, (ii) With categorical variables (iii) With interaction effects. Unfortunately, we were unable to generate a model that best captured the variability in the dataset and help us to predict the total infrastructure needs.

However there were explanatory that stood out in all three multivariate variables. I will look at these explanatory variables and refer to theory as to why they may have featured strongly across the models:
1.Log of Population (LOG_POP10): Across all models, this variable had a positive coefficient. It would make sense that as there is increased population, the needs also increase.
2.Residents receiving treatment (PRES_RES_REC_TRMT): Across all models, this variable had a positive coeffcient. Just like population, if there are more people receiving service, there will be more need.
3.MS4: For the models that had MS4 as a variable, the associated coefficient was negative. This is to be expected based on our analysis in the previous homework that showed that there is more need in locations that use a combined sewer system as opposed to a MS4.
4. TMDL: Facilities that emptied into waterbodies have extra infrastructure and regulatory needs hence driving up their costs. It is no surprise then that the TMDL variable had a negative coefficient across all models. 5. Epa Regions: All regions had a positive coefficient with the expection of Region 7. Region 7 had a negative coefficient. This would suggest that needs in region 7 are among the lowest in the country.

4 Report to the Mayor

To the mayor,

The planning department run a host of models to try and pick out a host of variables: demographic, geographic and socioeconomic that could help Philadelphia to predict its infrastructure needs so that it can plan accordingly. Unfortunately, we could not come up with a model that could enable the city to predict it’s infrastructure needs. However, there were some variables of interest that emerged from the models. In the previous report, I had mentioned that the city should consider installing MS4 systems rather than CSS systems based on the fact that these are more efficient and the data suggest that the overall need for the former system is less than the latter system.

A feature that was also apparent was the population served by the facility. Facilities that served huge populations had a corresponding increase in the infrastractural needs. This also translates to other population related variables such as the number of residents and non-residents receiving services from the facilities.

Lastly another unique feature was that facilites in region 7 seemed to have lower infrastructural needs when controlling for other variables in the various models. There could be something to learn from investigating the data on region 7’s facilities compared to region 3.

From the boxplot above, we can see the clear disparity when implementing a CSS sewer system, especially in region 7. Also worth noting is that the need in region 7 is smaller compared to region 3. This is to be expected since region 7 is the midwest and some facilities may serve fewer numbers than region 3. However, the CSS facilities in region 7 seem to have higher infrastructural needs. This is a strong case for building a MS4 system in the future.

There is a huge discrepancy in the TMDL facilities in region 7. There is only facility in the whole region that empties into a waterbody and that is located in Shawnee, Kansas. We can disregard this point and instead compare the MS4 non-TMDL facilities.

A look at the summary statistics for the regions might result in better results.

SUMMARY STATISTICS FOR REGION 3
TOTAL_OFFICIAL_NEED
 AWATER </th>
 ALAND </th>
CSS TMDL_INDICATOR LOG_POP10 PRES_RES_REC_TRMT EPA_REGION PCTWHITE10
Min. :2.184e+04 Min. :1.783e+05 Min. :1.416e+07 0:653 N:770 Min. :3.757 Min. : 0 3 :774 Min. :16.10
1st Qu.:1.480e+06 1st Qu.:7.536e+06 1st Qu.:9.326e+08 1:121 Y: 4 1st Qu.:4.525 1st Qu.: 1092 1 : 0 1st Qu.:85.10
Median :3.797e+06 Median :1.595e+07 Median :1.250e+09 NA NA Median :4.897 Median : 2706 2 : 0 Median :92.00
Mean :2.090e+07 Mean :9.006e+07 Mean :1.408e+09 NA NA Mean :4.916 Mean : 20640 4 : 0 Mean :87.89
3rd Qu.:1.061e+07 3rd Qu.:2.752e+07 3rd Qu.:1.787e+09 NA NA 3rd Qu.:5.280 3rd Qu.: 7990 5 : 0 3rd Qu.:96.40
Max. :1.429e+09 Max. :2.229e+09 Max. :3.182e+09 NA NA Max. :6.184 Max. :1835470 6 : 0 Max. :99.00
NA NA NA NA NA NA NA (Other): 0 NA
SUMMARY STATISTICS FOR REGION 7
TOTAL_OFFICIAL_NEED
 AWATER </th>
 ALAND </th>
CSS TMDL_INDICATOR LOG_POP10 PRES_RES_REC_TRMT EPA_REGION PCTWHITE10
Min. :1.714e+04 Min. : 94406 Min. :1.605e+08 0:1026 N:1050 Min. :2.811 Min. : 0.0 7 :1051 Min. :40.40
1st Qu.:2.806e+05 1st Qu.: 4347639 1st Qu.:1.398e+09 1: 25 Y: 1 1st Qu.:3.992 1st Qu.: 362.5 1 : 0 1st Qu.:89.00
Median :9.529e+05 Median : 11890320 Median :1.501e+09 NA NA Median :4.318 Median : 898.0 2 : 0 Median :95.00
Mean :1.515e+07 Mean : 16707310 Mean :1.727e+09 NA NA Mean :4.385 Mean : 7105.9 3 : 0 Mean :92.31
3rd Qu.:2.716e+06 3rd Qu.: 21138791 3rd Qu.:1.855e+09 NA NA 3rd Qu.:4.664 3rd Qu.: 2513.5 4 : 0 3rd Qu.:97.20
Max. :2.429e+09 Max. :136351823 Max. :1.544e+10 NA NA Max. :6.000 Max. :662376.0 5 : 0 Max. :99.00
NA NA NA NA NA NA NA (Other): 0 NA

We can see that across the board, there summary statistics for region 7 are smaller than for region 3, so it could be a case of a small population failing to exert a lot of pressure on the existing facilities.

Philadelphia could possibly schedule a tour to region 7 to understand their infrastructure challenges and see why their predictions are likely to be lower.

5 APPENDIX

The diagnostic plots for regression are here below:

5.1 Plots for Log of Need and Log of Population

5.2 Plots for Need and PCTWHITE

5.3 Plots for Need and AWATER

5.4 Plots for Need and MEDINC09

5.5 Plots for base Multivariate regression

TOTAL_OFFICIAL_NEED

LOG_TOTAL_OFFICIAL_NEED

STEPWISE PLOT LOG_TOTAL_OFFICIAL_NEED

LOG_TOTAL_OFFICIAL_NEED without intercept term

5.6 Plots for Multivariate Regression with categorical variables

Residual plots of regression with all continuous and categorical variables

Residual plots for best stepwise model with categorical variables

** Plots of regression without intercept terms**

Plots of stepwise regression of nointercept regression

5.7 Plots for Multivariate Regression with interaction Effects

Residual plot for stepwise regression